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DOCUMENT-IDENTIFIER : US 6128720 A 

TITLE: Distributed processing array with component processors performing customized 
interpretation of instructions 

Brief Summary Text (4 ) : 

Three standard techniques used in parallel processing architectures: these include 
pipelining, use of multiple simple processing elements, and multi-processing. 
Pipelines are used to achieve time parallelism, where the independent steps in 
multi -cycle operations are executed in parallel, thereby improving performance. An 
array of Processing Elements (PEs), termed an array processor, achieves a physical 
parallelism through operating the multiple elements synchronously in parallel. A 
multi-processing system achieves an asynchronous parallelism with multiple 
communicating processes executing on independent systems. This invention is 
concerned with the pipeline control of multiple processing elements connected 
together in some topology. 

Brief Summary Text (12) : 

Still further, the customized interpretation of a broadcast instruction by a 
particular PE can be used by that processor in directing the results of the 
arithmetic computations to other PEs in the array. Each PE executing an instruction 
in its customized mode, will drive steering information to steer the results of the 
computation from that PE to another selected PE in the array. 

Detailed Description Text (7) : 

Turning now to FIG. 2, a more detailed illustration is given of a processor element 
120. There it is seen that the processor element 120 includes an instruction buffer 
202 which is connected to the instruction bus 110. The instruction buffer 202 is 
then connected to an instruction register and decode 204. A data buffer 208 is 
connected to the data bus 112 through the interconnection switch 206. The data 
buffer 208 is connected to the multiplexer 210. A general purpose register file 212 
is connected to the multiplexer 210 and to a selector 214. A plurality of 
arithmetic units 216 perform various arithmetic functions FN1, FN2, through FNX. 
The local processor element interconnection switch 206 selectively connects the 
processor element 120 to various links 205 which connect to other processor 120 in 
the array 100. A mode register 207 is connected to the instruction register 204 and 
through the instruction buffer 202 to the instruction bus 110, for storing a 
topology configuration value from the topology field 302. 

Detailed Description Text (8) : 

Reference can be made to FIG. 5 which shows the instruction pipeline which includes 
components from the control unit 108 and the processing element 120. Along the left 
hand margin of FIG. 5 is a vertical line showing the phase stages for a first 
embodiment of the invention for arrays of processing elements which are 9. times. 9 
or fewer. Corresponding illustration is shown in FIG. 6 for arrays of processor 
elements which are greater than 10. times. 10. In FIG. 5, it is seen that the phases 
are divided into a first instruction fetch/distribute phase, followed by a decode 
phase, followed by an execute and communicate phase which is followed by a 
condition code return phase. In FIG. 6, it is seen that the phases are divided into 
a first instruction fetch phase, followed by a distinct distribute phase, which is 
followed by a decode phase, which is followed by an execute and communicate phase, 
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which is followed by a condition code phase. The distinct distribute phase shown in 
FIG. 6 is provided for larger arrays of 10. times. 10 processing elements or more, in 
order to enable additional time to distribute instructions among the various 
sequencers and processing elements. 

Detailed Description Text (19) : 

FIG. 1 depicts a high level view of the array processor machine organization. The 
machine organization is partitioned into three main parts: the System Interfaces 
including Global Memory and external I/O, multiple Control Units with Local Memory, 
and the Execution Array with Distributed Control PEs . The System Interface is an 
application-dependent interface through which the array processor interfaces with 
Global Memory, the I/O, other system processors, and the personal 

computer/workstation host. Consequently, the System Interface will vary depending 
upon the application and the overall system design. The Control Units contain the 
local memory for instruction and data storage, instruction fetch (I-Fetch) 
mechanisms, and operand or data fetch mechanisms (D-Fetch) . The Execution Array 
with Distributed Control PEs is a computational topology of processing elements 
chosen for a particular application. For example, the array may consists of N 
Processing Elements (PEs) per control unit, with each PE containing an Instruction 
Buffer (IBFR), a General Purpose Register File (GPRF) , Functional Execution units 
(FNS), Communication Facilities (COM), and interfaces to its Instruction/Data 
buses. The PEs may also contain PE-local instruction and data memories. Further, 
each PE contains an instruction decode register which supports distributed control 
of the multiple PEs. Synchronism of local memory accessing is a cooperative process 
between the control units, local memories, and the PEs. The array of PEs allows 
computation functions (FNS) to be executed in parallel in the PEs and results to be 
communicated (COM) between PEs. 

Detailed Description Text (21) : 

FIG. 2 depicts a generalized PE. The PE contains a COM facility identified as the 
local interconnection switch network. This COM facility provides the means for 
interfacing the GPRF and the arithmetic elements with neighboring PEs via the 
Links. The COM facility also provides the connecting interface to the control units 
local memory subsystem. The general philosophy of the instruction pipeline is for 
the Control Units, also termed Sequence Processors (SPs), to access instructions 
and pass on to the PEs any instructions designated to go there. This can be 
accomplished by use of the tagging of instructions. In a multiple PE organization 
there is a need to load single and multiple PEs, store register/status registers 
from single PEs to memory, and control the PEs in different topologies. Rather than 
proliferate opcodes to accomplish these tasks, tags are created and concatenated to 
the instructions for PE decode' and control. Tags operate as a mode control 
extension field to the instruction formats. By use of the VLIW concept, the 
operating modes can be changed on a cycle by cycle basis if required. Since the tag 
is generated from information stored in a special purpose register, its definition 
can be machine dependent allowing smaller tags for small machine organizations and 
larger tags for larger organizations. FIG. 3 depicts a generic form for the tag. 
The instructions executed by the Processing Elements contain 32 bits plus a system 
dependent tag field. As an example, the Instruction Tag bits can convey specific 
status and mode information registered in the SP to the PEs, as well as specific PE 
identifier values to support individual loading of the PEs. All instructions are 
defined as broadcast operations going to all PEs associated with a specific control 
unit's instruction bus. Specific tagged compute instructions are controlled by the 
tag field. If tagged compute is not specified all PEs execute the instruction 
independent of the tag-code field. 

Detailed Description Text (23) : 

In many processors, a three-phase fetch, decode, and execute pipeline is used for 
the basic instruction execution control where the instruction fetched is received 
in a single instruction decode unit. This requires that an instruction be fetched 
from one of N instruction memories and then be distributed to N sequencers and PEs 
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with both the sequencers and PEs decoding and executing in synchronism the received 
instructions. Depending upon topology size and the intended cycle time, the 
fetching and distributing of instructions can be accomplished in either a combined 
fetch/distribute cycle or in separate fetch and distribute cycles . For scalable 
topologies under consideration of 2. times. 2 up to 10. times. 10 it is envisioned that 
a combined fetch/distribute cycle is appropriate. In order to handle relatively 
high usage arithmetic conditional branch and PE generated exception conditional 
branch operations, a separate exception condition return phase is provided and two 
branch execution timings architectured. FIG. 4 A and FIG. 5 show two views of the 
four phase pipe; fetch/distribute, decode, execute, and condition code return (See 
Table 1) . Table 1 shows a four phase instruction pipeline diagram example which is 
depicted in FIG. 5. FIG. 4B and FIG. 6 depict a five phase pipeline for larger 
topologies with fast cycle times (See Table 2) . Table 2 shows a five phase 
instruction pipeline diagram example which is depicted in FIG. 6. 

Detailed Description Text (25) : 

1. It allows the maximum possible time in the paths from (on-chip) instruction 
memory to the Sequencer, and from Sequencer to PEs. 

Detailed Description Paragraph Table (2) : 

perform, a forced NOP is generated during this phase. At the end of this phase, the 
decoded instruction is latched into the SP Execute Register (SXR) and PE Execute 
Register (PXR) . EX/COM Instruction Execute: During this phase, the decoded 
instructions in the Execute Registers are executed and the results are communicated 
to the DEST target registers. Note that an instruction will execute in the SP and 
its associated PEs at the same time . CCR Condition Code Return: A condition code is 
returned from PEs to the sequencer at the end of the CCR phase. 
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CLAIMS : 

22. The method of claim 20, which further comprises: 

said fetching step and said distributing step being performed in a single machine 
cycle . 

23. The method of claim 20, which further comprises: 



said fetching step and said distributing step being performed in separate, 
consecutive machine cycles . 
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